Data Visualization Portfolio

Intro

Public transportation is a crucial component of Charlottesville’s urban infrastructure. It’s associated with social mobility, urban accessbility, and economic development.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.1     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(leaflet)
library(lubridate)
library(sf)
## Linking to GEOS 3.13.0, GDAL 3.8.5, PROJ 9.5.1; sf_use_s2() is TRUE
library(leaflet.extras)
library(readr)
df <- read_csv("~/Downloads/Transit_2020.csv")
## Rows: 215407 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Stop, Route, Date_Time, FareCategory, PaymentType
## dbl (5): TransitID, Count, Fare, Latitude, Longitude
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

I imported the data set from “City of Charlottesville Open Data”, a data set website by the Charlottesville government to provide access to the general public about the municipal information. The data set I chose took data from the two bus systems, The CAT (Charlottesville Area Transit) and the UVA transit system(gold lines, green lines, orange lines, silver lines). It collects the stops that the buses make with the stops’ coordinates and time in the day.

First Research Question

My first research question was “What’s the busiest hour of a weekday?” I first filtered out the data from Monday to Friday. Then, I grouped the data by the column “Hour” and computed the sum of the stops for every hour. At last, I created a bar chart using ggplot with theme “classic” because I want the bar chart to be as neat as possible.

df <- df |>
  mutate(Date_Time = ymd_hms(Date_Time),
         Hour = hour(Date_Time),
         Date = as.Date(Date_Time))

df_weekdays <- df |>
  filter(is.element(wday(Date_Time), 2:6))

hourly_totals <- df_weekdays |>
  group_by(Date, Hour) |>
  summarise(Daily_Total_Ridership = sum(Count, na.rm = TRUE), .groups = 'drop')

hourly_avg <- hourly_totals |>
  group_by(Hour) |>
  summarise(Average_Ridership = mean(Daily_Total_Ridership, na.rm = TRUE))

ggplot(hourly_avg, aes(x = Hour, y = Average_Ridership)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Weekday Average Ridership by Hour",
       x = "Hour of the Day",
       y = "Average Ridership Per Hour") +
  theme_classic() +
  scale_x_continuous(breaks = seq(0, 23, 1))

The chart is left-skewing with a peak at around 9 pm with a second peak at around 1 am. It’s surprising that the busiest hour turned out to be from 9 pm to 10 pm. The result implied that the buses make the most stops during this time of the day, most likely due to the large amount of students who choose to take the bus home or to parties. And the 1 am peak can be explained by students taking the bus back home from parties or libraries.

Second Research Question

My second research question was “What is the busiest area?” I first downloaded the Charlottesville region map from the Charlottesville open data. Then I used the leaflet package, a widget that can be rendered on HTML pages. I imported the Charlottesville map, and used the coordinates information of each stop from the Transit_2020 data set to create a heatmap and a dot density map. In addition, I created a marker using the coordinates of UVA, just to see what’s the impact of UVA population on the stops of the buses.

df <- df |>
  mutate(Date_Time = ymd_hms(Date_Time, tz = "UTC"))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Date_Time = ymd_hms(Date_Time, tz = "UTC")`.
## Caused by warning:
## !  144 failed to parse.
df_afternoon <- df |>
  filter(hour(Date_Time) >= 21 & hour(Date_Time) <= 22)

The heatmap:

df_afternoon <- df_afternoon|>select(Latitude, Longitude, Count) |> na.omit()

charlottesville_boundary <- st_read("~/Downloads/Municipal_Boundary_Area.geojson") 
## Reading layer `Municipal_Boundary_Area' from data source 
##   `/Users/dylanli/Downloads/Municipal_Boundary_Area.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 1 feature and 5 fields
## Geometry type: POLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -78.52377 ymin: 38.00968 xmax: -78.44636 ymax: 38.07053
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
charlottesville_boundary <- st_zm(charlottesville_boundary)

uva_lat <- 38.0336
uva_lon <- -78.5070

heat_map <- leaflet(df_afternoon)|>
  addTiles() |>
  addHeatmap(
    lng = ~Longitude, lat = ~Latitude,
    intensity = ~Count, 
    blur = 20, radius = 15, max = max(df$Count, na.rm = TRUE)
)|>
  addPolygons(
    data = charlottesville_boundary,
    color = "blue", weight = 2, fillOpacity = 0.1,
    popup = "Charlottesville City Boundary"
  ) |>
  addMarkers(
    lng = uva_lon, lat = uva_lat,
    popup = "University of Virginia",
    label = "University of Virginia",
    labelOptions = labelOptions(noHide = TRUE, direction = "top", textOnly = TRUE)
  ) |>
  setView(lng = -78.5070, lat = 38.0336, zoom = 13)
heat_map

The dot density map:

charlottesville_boundary <- st_read("~/Downloads/Municipal_Boundary_Area.geojson") 
## Reading layer `Municipal_Boundary_Area' from data source 
##   `/Users/dylanli/Downloads/Municipal_Boundary_Area.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 1 feature and 5 fields
## Geometry type: POLYGON
## Dimension:     XYZ
## Bounding box:  xmin: -78.52377 ymin: 38.00968 xmax: -78.44636 ymax: 38.07053
## z_range:       zmin: 0 zmax: 0
## Geodetic CRS:  WGS 84
charlottesville_boundary <- st_zm(charlottesville_boundary)

uva_lat <- 38.0336
uva_lon <- -78.5070

circle_map <- leaflet(df_afternoon) |>
  addTiles() |>
  addCircles(
    lng = ~Longitude, lat = ~Latitude,
    radius = ~Count * 10,
    weight = 1, color = "red", fillOpacity = 0.5,
    popup = ~paste("Ridership Count:", Count)
  ) |>
  addPolygons(
    data = charlottesville_boundary,
    color = "blue", weight = 2, fillOpacity = 0.1,
    popup = "Charlottesville City Boundary"
  ) |>
  addMarkers(
    lng = uva_lon, lat = uva_lat,
    popup = "University of Virginia",
    label = "University of Virginia",
    labelOptions = labelOptions(noHide = TRUE, direction = "top", textOnly = TRUE)
  ) |>
  setView(lng = -78.5070, lat = 38.0336, zoom = 13)

circle_map

The heatmap shows that UVA is a transit hub, as the hottest (reddest) coverage are concentrated around UVA. The dot density map provides a more detailed route of the buses. We can see that the main bus line goes through the downtown corridor and the neighborhoods in the north part of Charlottesville.

After a comparison with the poverty distribution in Charlottesville, we can see that there are in general less bus coverage in poorer neighborhoods such as Woolen Mills, Ridge Street, and Belmont. People who live there need more public transportation, but might not be able to do so due to the lack of infrastructure, funding, and the historical reason of gentrification.

Conclusion

In general, this study provides information on the busiest hour and region. It helps me realize the existing flaws of the bus system as the poor neighborhoods were not fully covered. I hope the research can demonstrate to people what the transit system looks like and bring more attention to the vulnerable community.